C/C++ Users Group Library 1996 July

home *** CD-ROM | disk | FTP | other *** search

/ C/C++ Users Group Library 1996 July / C-C++ Users Group Library July 1996.iso / vol_300 / 333_01 / awk.doc < prev next >

Wrap

Text File | 1989-04-21 | 80KB | 2,047 lines

gAWK Documentation Feb 10, 1989 - Bob Withers INTRODUCTION This document is intended as a description of the AWK language as implemented in gAWK, a public domain program which originated with the GNU project. It is not intended as an all inclusive training document, please see the references section for material that meets this need. AWK is a pattern matching language which may be used to create programs which manipulate ASCII data files. AWK derives some of its features from SNOBOL and some from the 'C' language. The basic AWK program consists of a series of patterns and associated actions. Each input record is tested with each pattern in the program and the actions associated with those that match are executed. The format for an AWK program is as follows: pattern { action } pattern { action } AWK input is generally processed by an "implicit input loop" which was borrowed from the SNOBOL language. AWK reads input records from the specified files, breaks them into fields based upon program controllable delimiters, and matches them against the patterns in the AWK program. Each pattern which is TRUE for the current record has its associated action statements executed. The fields created for each record are given special variable names and may be used by the AWK program. The special variable $0 is used to reference the entire input record in exactly the format it was read. $1 refers to the first field of the record, $2 the second, and so on. For example, suppose AWK was breaking fields apart based on a comma delimiter. The record: Now,is the, time, for all good men would be parsed as follows: $0 = "Now,is the, time, for all good men" $1 = "Now" $2 = "is the" $3 = " time" $4 = " for all good men" Special builtin AWK variables provide information about the parsing of input lines and allow programs to override the gAWK Documentation - Page 1 default processing. After each input record is parsed into fields the builtin variable NF is set to the number of fields in the record. In the above example NF would be set to 4. Two builtin variables control the way AWK parses input files into records and fields. The RS (Record Separator) builtin variable is used by AWK to determine the delimiter for records. It may be set to any single character and is by default set to the newline character ("\n"). The variable FS (Field Separator) is used by AWK to determine how fields within records are parsed. Until recently FS was restricted to a single character value also. The current Unix version of AWK (called nawk) has greatly enhanced the use of the FS variable and these enhancements are supported in this version of gAWK. Rather than having FS represent a single character field delimiter gAWK treats the contents of FS as a regular expression. The default value of FS in gAWK is "[ \t]+" which means that fields are delimited by one or more blanks or tabs (whitespace). For most input files this default is acceptable but both the FS and RS variables may be overridden on either the AWK command line or within an AWK program. More information is provided on both builtin variables and regular expression later in this document. AWK COMMAND LINE PARAMETERS The format of the AWK command line is as follows: AWK [-Ffs] [-Rrs] {"program" | -f progfile} [datfile ...] In the above command line brackets [ ] indicate and optional argument and braces { } indicate a choice. The optional -F switch may be used from the command line to override the default value of the FS builtin variable used to parse input records into fields. Under both MSDOS and OS/2 it is best to enclose the -F switch within double quotes if it contains spaces or special characters. For example to parse input fields delimited by commas, semi-colons, and colons one might code the -F switch as "-F[,;:]". The optional -R switch can be used to override the default value for the RS builtin value. If, for example, records are to be delimited by an ampersand we could code the -R switch as -R@. In general these command line switches are seldom used. The AWK language provides a means to override these variables within the program and this is generally preferable to having to remember to place the correct value on the command line. The actual statements of the AWK program are either supplied on the command line or in an ASCII text file. Providing the AWK program on the command line is very popular in the Unix gAWK Documentation - Page 2 environment, however, due to limitations of the command line length under MSDOS and OS/2 it is practical only for very short programs. The following AWK program is supplied on the command line and will print all the records in the file MYFILE.DAT: AWK "{ print $0 }" MYFILE.DAT It is more common for a program to be placed in an ASCII file and specified on the command line via the -f switch. The recommended file name extension for these files is .AWK. If the above program were placed in the file MYPROG.AWK the following command line would perform the same function as the previous: AWK -f myprog.awk myfile.dat The file(s) to be operated upon follow the switches and/or AWK program on the command line. Any number of files may be specified and the normal MSDOS and OS/2 wildcard characters may be used to include all matching file names. The files are processed in the order they are listed on the command line. Special command line assignment statements may also be included within the file name list of the command line. These assignments take place in the order they appear on the command line. This feature may be used to provide information to the AWK program relative to the files being processed. The format of these assignment statements are variable=value and they are only restricted by the limits of the command line length. Again, if the value contains spaces or special characters it is best to enclose the entire assignment within double quotes to instruct the operating system shell to parse it as a single argument to AWK. Following is an example that uses the variable "p" to instruct the AWK program of the number of the file currently being processed: AWK -f myprog.awk p=1 file1.dat p=2 file2.dat p=3 file3.dat In the above execution the program MYPROG.AWK can refer to the variable "p" to determine which file is being processed. "p" is set to 1 before processing begins on FILE1.DAT. It is set to 2 when FILE1.DAT is closed and before FILE2.DAT is opened, and so on. There are better methods built into AWK to determine this information but this example illustrates the feature of command line assignments. REGULAR EXPRESSIONS Many useful programs can be written with AWK without the use of regular expressions, however, they are one of the most powerful features of the language. We will therefore take a gAWK Documentation - Page 3 short detour into a discussion of regular expressions before looking at the pattern matching features of AWK. A regular expression is a notation for specifying a pattern for matching strings. Regular expressions contain characters which have special meaning and may be considered operators just as plus (+) and minus (-) are arithmetic operators in most languages. These special characters are called metacharacters. Following are the regular expression metacharacters supported by AWK: \ ^ $ . [ ] | ( ) * + ? A regular expression in AWK is surrounded by forward slash characters and does not have to contain any metacharacters. A regular expression without metacharacters matches itself. The regular expression /ABC/ will match and string that contains the substring "ABC". Note that the match is case sensitive and will not match the substring "ABc". The following table describes the format of regular expressions where "c" is a non metacharacter, "m" is a metacharacter, and "r" is a regular expression: c Matches the non metacharacter c \m Treats metacharacter m as a literal character ^ Forces match to the beginning of the string $ Forces match to the end of the string . Matches and single character [ccc] Matches any single character in the class [^ccc] Matches any single character not in the class [c-c] Matches the range of characters specified [^c-c] Matches any character not in the range specified r | r Matches any string that matches either expression (r1)(r2) Matches string that matches r1 and is immediately followed by a string that matches r2 (r)* Matches zero or more consecutive strings matched by r. AWK matches the longest string possible. (r)+ Matches one or more consecutive strings matched by r. AWK matches the longest string possible. (r)? Matches zero or one occurrence of the string matched by r. As we've already seen a regular expression that contains no metacharacters matches itself. If this were the extent of features offered, regular expressions would be of little use. It is the metacharacters or "operators" which provide the power of regular expressions. We will look at each of the metacharacters, describe how they are used, and give some examples. The "literal" metacharacter \ is used to remove the special properties associated with a metacharacter so that it can be matched as a normal character. To match a string containing a dollar sign we could code a regular expression /\$/ which gAWK Documentation - Page 4 would do the job. Likewise to match the letter A followed by a backslash followed by the letter B we could code the regular expression /A\\B/. The literal metacharacter is also used to give special meaning to otherwise normal characters. These special characters were inherited from the 'C' language and should be familiar. They are: \b backspace character \f formfeed character \n newline character \r carriage return character \t tab character \ddd octal value ddd where ddd is 1 to 3 digits between 0 and 7 The match beginning of line metacharacter forces a match to occur at the beginning of a line. The symbol used is the caret (^). To match all lines which begin with a Z we could code /^Z/. Note that the caret only has meaning at the beginning of a regular expression (and within character classes as we'll see shortly). The use of a caret within a regular expression is treated as a normal character although is it prudent to use the backslash literal metacharacter anyway if that is the intend. For example, the regular expression /AB^/ should match the same strings as /AB\^/. The match end of line ($) metacharacter is similar to the caret operator only it forces the match to the end of the line. Matching all lines which end with a question mark could be coded as /\?$/. Note that since the question mark is a metacharacter its use as a literal must be "quoted" by the literal metacharacter. Lets look at some examples using both the caret and the dollar sign: /^XX$/ Matches strings which consist of only the two characters "XX" /^.$/ Matches strings which are exactly one character /^\.$/ Matches strings which are exactly one character and are equal to a period. (compare this with the previous example) The period (.) metacharacter, as seen in the above examples matches any single character. Therefore the regular expression /A..B/ will match any string which has a capital letter A and a capital letter B separated by any two other characters. The bracket metacharacters [ ] are used to define characters classes. A character class can be used to match a single character but allows alternatives to be supplied. To match any string which contains the letter A or the letter B we gAWK Documentation - Page 5 could code /[AB]/. To match any string that start with an A or a B we code /^[AB]/. If a character class begins with a caret the operation is negated, i.e. the expression matches characters that are not part of the class. To match strings which begin with anything other than an A or a B we could code /^[^AB]/. Don't confuse the begin of line metacharacter with the character class negation character. A caret appearing anywhere within a character class is treated as a literal character, /[A^B]/ will match string containing either an A, a B, or a caret. Character classes allow a range of characters to be specified by using a dash to separate the first character of the range from the last. Matching a string containing any lower case letter could be coded as /[a-z]/ which is much easier than having to enumerate all twenty six letters. Multiple ranges may be specified and combined with single letter values. The regular expression /[A-CXYI-K]/ will match a string containing any of the following characters A,B,C,X,Y,I,J,K. Expressions containing ranges may also be negated as in /[^ABJ-K]/. The next metacharacter is the alteration or "OR" operator. This operator allows an expression to match if any of its subexpressions match. The expression /A|B/ will match any string containing either A or B. Parenthesis are used to group expressions to override the normal operator precedence. For example the expression /ABC|XYZ/ looks like it might match strings containing either ABC or XYZ. However, due to the higher precedence of the | operator it actually matches strings containing either ABCYZ or ABXYZ. To match strings containing either ABC or XYZ we must code the expression as /(ABC)|(XYZ)/. We will treat the last three metacharacters as a group and label them the "repeat operators". Technically they are known as the closure operators and their function is to allow a subexpression to be repeated. The * metacharacter repeats a subexpression zero or more times. The expression /A*/ matches the strings "", "A", "AA", "AAA", and so on. Likewise /AB*/ matches "A", "AB", "ABB", etc. Parenthesis may be used to repeat more than a single character as in /(AB)*/ which would match "", "AB", "ABAB", etc. The + metacharacter is similar to the * but it will not match the NUL string "" (zero repeats) like * does. The expression /[ABC]+/ will match one or more consecutive characters in the set ABC as in "A", "B", "CA", CCCBA", etc. The final metacharacter is the question mark and is used to match exactly zero or one occurrence of the expression. The expression /AB?/ will match "A" or "AB". gAWK Documentation - Page 6 In AWK all of the repeat metacharacters will match the largest possible substring, therefore given the string "AAAAAAAAAA" and the regular expression /A+/ the entire string will be matched rather than just the first character. PATTERNS Patterns in AWK are used to select particular input records for a specific type of processing. They are conditional expressions which cause their associated action to be performed if they are TRUE. Following are the types of patterns supported: BEGIN Special pattern which is performed before the first input file is opened. END Special pattern which is performed after the last input file has been processed. expression Action is executed for each input line where "expression" is TRUE. /reg exp/ Action is executed for each input line that is matched by the regular expression. compound pat A compound pattern is comprised of several patterns connected by the boolean operators && (AND), || (OR), ! (NOT), and parentheses. pat1, pat2 A range pattern matches each input line starting with one matched by "pat1" up to and including one matched by "pat2". empty The empty pattern consists of only an action. The pattern is unconditionally TRUE and the action is executed for every input record. The BEGIN and END special patterns are not used to match input lines but rather are used to perform program initialization and termination. The action associated with the BEGIN pattern is executed before AWK reads any input records. It can be used to initialize variables, print headings, or set AWK builtin variables which control input and output field splitting. The END special pattern is matched after all input files have been processed. It can be used perform cleanup or print accumulated totals. For example, the following AWK program counts the number of input lines and uses the END pattern to print out the result: { ++cnt } END { print cnt, "records were read" } The first pattern/action pair in this program adds one to the variable "cnt" for each input record processed by AWK. It consists of only an action, making use of the "empty" pattern to match all input records. After all input records are gAWK Documentation - Page 7 processed, the END pattern/action pair is executed and prints the accumulated value of "cnt". The "expression" pattern is a conditional expression which, if TRUE, will cause the action associated with it to be executed. AWK has a rich set of comparison operators which may be used in conjunction with builtin variables, program defined variables, and/or AWK field variables. The following table presents the comparison operators supported by this version of AWK: < Less than <= Less than or equal to == Equal to != Not equal to >= Greater than or equal to > Greater than ~ Matched by !~ Not matched by If we wanted to process input records which contained more than 5 fields we could make use of the NF builtin variable to construct a pattern that would match these records: NF > 5. AWK conditional expressions can also contain arithmetic or string operators. If our input data had employee hourly rate in field #1 and number of hours worked in field #3 then the pattern $1 * $3 > 100 would select input records where the employee's pay is greater than $100.00. Most of the comparison operators used in AWK are similar to those available in other high level languages and should be readily understood. The match operators found in AWK are not quite as common and deserve some explanation. These operators are used to match an expression against a regular expression. The tilde (~) is the match operator and can be negated by use of the exclamation mark (!~). For example, if we wanted to print records where the 5th field contained the string "Jones" we could code the following program: $5 ~ /Jones/ { print $0 } This program will use the literal regular expression specified as the second argument of the match operator to compare against the expression which is the left argument. If a match is found the pattern is TRUE and the action is executed. Likewise printing all records which did not contain the string "Jones" in field 5 would be coded as: $5 !~ /Jones/ { print $0 } Note that the match operation is a regular expression search. If field five contained the string "Where is Jones?" the regular expression /Jones/ would match it. If an exact match is desired use the equality operator as in: gAWK Documentation - Page 8 $5 == "Jones" { print $0 } The match operator supports a new AWK feature called "dynamic regular expressions". This feature allows the value of an expression to be compiled as a regular expression and used as such. The value of this expression must be a valid regular expression or a run time error will occur. Consider the pattern "$1 ~ $5" which instructs AWK to treat the value of field #5 as a regular expression and use it to match the contents of field #1. For each input record field #5 could be a different regular expression. Our program to search for the string "Jones" in field #5 could be coded as: BEGIN { str = "Jones" } $5 ~ str { print $0 } Use of dynamic regular expressions requires AWK to syntax check and compile the expression each time it is used. For this reason dynamic regular expressions are not as efficient as literal regular expressions which are checked and compiled only once. They are however very powerful and are well worth the slight performance degradation if your application needs them. There is a case of regular expression matching which occurs so frequently that AWK provides a special shorthand notation. The pattern "$0 ~ /Jones/" will match the regular expression against the entire input record and evaluate as TRUE if there is a match. This format of the match operator can be shortened to simply specifying the regular expression. The following program will print all records which contain the string "Jones". /Jones/ { print $0 } Compound Patterns A compound pattern is an expression which uses logical operators to combine other patterns. The available logical operators are AND (&&), OR (||), and NOT (!). $1 == "Jones" && NF > 10 The above program will print each input record where the first field is equal to the string "Jones" AND the number of fields in the record is greater than ten. Note that we have omitted the action portion of the program. If a pattern is present the action may be omitted and will perform the default action which is equivalent to { print $0 }. $1 == "Jones" || !(NF > 10) The above program will print all input records where the gAWK Documentation - Page 9 first field is equal to the string "Jones" OR the number of fields in the record is less than or equal to ten (take a good look at it). Range Pattern The range pattern is a special construct which can be used to match a series of input records. The format is "pat1, pat2" where pat1 and pat2 are regular expressions. The pattern will return TRUE when pat1 matches an input line and continue to be TRUE up to (and including) an input line which matches pat2. For example: /Jones/, /Sampson/ This program will print all input records beginning with one matching the string "Jones" and continuing up to and including a record that matches "Sampson". Summary of Patterns Pattern Example Matches BEGIN BEGIN Before any input is read END END After all input has been read expression $1 > 50 Lines with the first field greater than 50 matching /Jones/ Lines that contain the substring "Jones" compound $1 < 5 && $1 > 0 Lines where the first field is between 1 and 4 range NR == 1, NR == 20 The first 20 input records ACTIONS The action portion of an AWK program defines the statements to be executed with a pattern associated with them is found to be TRUE for the current input record. As we've seen the actions portion can be omitted in which case the default action of printing the matching record is performed. The pattern portion of a statement may also be omitted which creates a pattern that will match all input records. However, both the pattern and action cannot be omitted, either one or both must be present. The statements supported by AWK in the actions section are gAWK Documentation - Page 10 similar to the constructs of the 'C' Language. Following are the allowable statements, capital letters indicate portions of the statement which includes variable information: print EXPRESSION-LIST printf(FORMAT, EXPRESSION-LIST) if (EXPRESSION) STATEMENT if (EXPRESSION) STATEMENT else STATEMENT while (EXPRESSION) STATEMENT do STATEMENT while (EXPRESSION) for (EXPRESSION; EXPRESSION; EXPRESSION) STATEMENT for (VARIABLE in ARRAY) STATEMENT delete ARRAY-ELEMENT break continue next exit { STATEMENTS } VARIABLE = EXPRESSION Expressions Expressions in AWK can consist of constants, variables, builtin variables, field variables, arithmetic expressions, string expressions, conditional expressions, relational expressions, builtin functions, or user defined functions. We will look at each of these in turn. Expressions - Constants AWK supports two data types which are NUMBER and STRING. String constants are written surrounded by double quotes and may contain "escape characters" as used in 'C' Language strings. For example, to create a string literal which contains the single character double quote we would code "\"". Other examples of string constants are "Jones", "Hello, World", and "" which is the NUL string. Number constants are real numbers and are written without quotes. Numbers may be written as integers (556), decimal numbers (5.17), or exponential notation (5.17E-2). All numbers are stored in floating point which, in this implementation, uses the 'C' type double. Expressions - Variables User defined variables in AWK are created when they are first referenced. The programmer does not need to specify the type of data the variable will store, AWK infers this from the operations performed on the variable. In fact the type of data may change during the execution of the program and AWK will convert the current contents of the variable to the required type. All variables are created empty. In the case of string variables they contain the NUL string and in the gAWK Documentation - Page 11 case of number variables they contain the number zero. Each user defined variable is composed of letters, numbers, and underscores and must not begin with a number. Examples are: total_count, sum, and my_var. Expressions - Builtin Variables AWK contains a number of builtin variables which may be used to obtain information and/or control the operation of reading and splitting fields. All builtin variable names are spelled with all capital letters. Following is a list of supported builtin variables: Variable Meaning ARGC Number of command line arguments ARGV Array of command line arguments FILENAME Name of the current input file FNR Record number within the current file FS Input field separator (reg exp) NF Number of fields in the current record NR Record number of current record relative to start of execution OFMT Output format for numbers OFS Output field separator (string) ORS Output record separator RLENGTH Length of string matched by match() function RS Input record separator RSTART Start of string matched by match() function SUBSEP Subscript separator Following are the default values of these builtin variables: Variable Default ARGC Varies ARGV Varies FILENAME Varies FNR Varies FS "[ \t]+" NF Varies NR Varies OFMT "%.6g" OFS " " ORS "\n" RLENGTH 0 RS "\n" RSTART 0 SUBSEP "\034" The builtin variables may be used just like user defined gAWK Documentation - Page 12 variables. For example, the following program will count the number of input files and display this value and the end of processing: prev != FILENAME { ++no_files; prev = FILENAME } END { print no_files, "file(s) input" } The user defined variable "prev" is created and initialized to the NUL string and will therefore not be equal to the first filename processed. When this happen the variable "no_files" is incremented and the value of "prev" is set equal to the current filename. At the end of input the number of different files encountered is displayed. Expressions - Field Variables As discussed previously, AWK splits input records into fields based on the regular expression contained in the builtin variable FS. These fields may be accessed or modified by the AWK program by field number. Fields are numbered beginning from one (1). The dollars ($) specifier is used to inform AWK that an expression refers to a field. For example, $1 refers to the first field in a record and $5 refers to the fifth field. The special field variable $0 is used to refer to the entire input record just as it was read in by AWK. The expressions used to specify field variables do not need to be numeric constants but can be any numeric expression. Given that the builtin variable NF contains the number of fields in the current records the variable $(NF - 1) refers to the next to the last field. Assume that an AWK program was to print out the value of a single field for each input record and that the number of the field to be printed was contained in the first field of each record. The following AWK program would meet this specification: { print $($1) } This version of AWK permits assignments to field variables. If a single field is assigned a new value the contents of the $0 variable are modified accordingly. If a new value is assigned to the $0 variable all field variables are recalculated and a new value is assigned to NF. Expressions - Arithmetic Expressions AWK provides the usual arithmetic operators which may be used to calculate numeric results. All Arithmetic is performed in floating point using double precision storage. Following are the individual operators supported: gAWK Documentation - Page 13 Operator Function Example + Addition $1 + $2 - Subtraction total + $4 - Unary minus -total * Multiplication x * y / Division $1 / x % Modulo (remainder) x % y ^ Exponentiation $1 ^ 5 ++ Pre/Post increment ++x or x++ -- Pre/Post decrement --x or x-- Expressions - String Expressions There is only one string operator supported by AWK. It is concatenation and is represented by spaces between variables and/or constants. The following program assigns some constants to string variables and the concatenates them into a single variable: BEGIN { x = "String 1"; y = "String 2" z = "(" x ":" y ")" print z exit } The output of this program will be: (String 1:String 2) While discussing string expressions seems like a good opportunity to bring up AWK's use of dynamic regular expression. A dynamic regular expression in AWK is simply a string variable which is treated as a normal regular expression. Strings which contain valid regular expressions can be used anywhere that a literal regular expression can be used. For example the following program makes use of a dynamic regular expression to print input which consist solely of integer numbers: BEGIN { num = "^[0-9]$" } $0 ~ num Notice that the action portion of the second rule of this program is missing. A missing action performs the default action of printing the input record when the pattern is TRUE. The astute reader will have observed that AWK's builtin variable FS is nothing more than a dynamic regular expression which is used to delimit fields within input records. gAWK Documentation - Page 14 Expressions - Conditional Expressions The AWK conditional expression has the form: exp1 ? exp2 : exp3 Exp1 is evaluated and if the result of it is TRUE (nonzero or nonNUL) the value of the conditional expression is the value of exp2. If exp1 is FALSE then the value of the conditional expression is the value of exp3. Consider the following AWK program fragment: END { print tot, "file" tot == 1 ? "" : "s", "read" } Presumably the variable "tot" was calculated during the course of the program and represents the number of files read. The END action intends to print out this number. We make use of a conditional statement in this action to make the word "file" singular if there was only one file read, otherwise we make it plural by adding an "s". Notice that we use the string concatenation operator to append the "s" to the literal "file" during printing to avoid having a field separator placed between them. Expressions - Relational Expressions Relational expressions consist of expressions formed using the AWK comparison operators. These expressions have either a TRUE (1) or FALSE (0) value. Following are the comparison operators supported by AWK: Operator Meaning Example < Less than x < y <= Less than or equal to x <= y == Equal to x == y != Not Equal to x != y >= Greater than or equal to x >= y > Greater than x > y ~ Is matched by x ~ y !~ Is not matched by x !~ y Relational expressions may be combined by using the logical operators && (AND), || (OR), and ! (NOT). Expressions - Builtin Functions The functions built into AWK may be divided into two categories: arithmetic and string. The following tables list the available functions in each category. The notation used gAWK Documentation - Page 15 to represent the type of function arguments is: x, y ==> Numbers s, t ==> Strings r ==> Regular Expression a ==> AWK array variable Arithmetic Builtin Functions Function Value Returned atan2(x,y) arctangent of x/y cos(x) cosine of x, with x in radians exp(x) exponential function of x, e ^ x int(x) integer part of x log(x) natural (base e) logarithm of x rand() random number n, where 0 <= n < 1 sin(x) sine of x, with x in radians sqrt(x) square root of x srand(x) seed random number generator with x String Builtin Functions gsub(r,s) substitute s for r globally in $0, return the number of substitutions made gsub(r,s,t) substitute s for r globally in string t, return the number of substitutions made index(s,t) return first position of string t in string s or 0 if t is not present length(s) return the number of characters in s lower(s) return string s with all upper case letters converted to lower case match(s,r) test if string s contains a substring matched by regular expression r, return index of match or 0 if none; sets builtin variables RSTART and RLENGTH reverse(s) return the string s reversed split(s,a) split string s into array a on FS, return number of fields split split(s,a,r) split string s into array a on regular expression r, return number of fields sprintf(f,exp,...) similar to the C sprintf function. string f is a format specifier and the expression list is used to "fill in" the % placeholders. the return value is the resultant string sub(r,s) substitute s for the leftmost longest substring of $0 matched by r, return the number of substitutions made (0 or 1) sub(r,s,t) substitute s for the leftmost longest substring of t matched by r, return the number of substitutions made (0 or 1) substr(s,x) return the suffix of s starting at position x gAWK Documentation - Page 16 substr(s,x,y) return substring of s starting at position x for length y system(s) invoke an operating system command shell and execute string s as a command upper(s) return string s with all lower case letters converted to upper case Expressions - User Defined Functions User defined functions are not supported in this version of AWK. Support for this feature is currently under construction and will be available in the next release of the software. Statements The AWK statements define the actions to be performed upon variables and expressions. The available statements are very "C like" in both syntax and semantics. The types of statements supported are listed in the introduction to the ACTIONS section. AWK statements may be terminated by a semi- colon, however, this is only required if more than one statement appears on a single line. For example: BEGIN { FS = "\t"; OFS = ","; } In this example the semi-colon following the first assignment statement is required, however the second (or last) semi- colon may be omitted. We will now take a closer look at each of these. Statements - print The "print" statement is used to produce simple output from one or more expressions. Each expression to be printed is separated by a comma. If desired, the expression list may be surrounded by parentheses. Each comma separated expression is printed as an output field. Fields in the output record are separated by the value contained in the OFS builtin variable. The last expression in the print statement is terminated by the "record separator" value contained in the ORS builtin variable. String expressions are converted for output via the "%s" format specifier. Numeric expressions are converted for output by using the format specifier contained in the OFMT builtin variable which defaults to "%.6g". This value can be changed by the program to alter the format of numeric fields. The following example uses the "print" statement to process a comma delimited input file containing five fields while exchanging the positions of the second and third fields: gAWK Documentation - Page 17 BEGIN { FS = OFS = "," } { print($1, $3, $2, $4, $5) } The output of the print statement will be directed to the standard output device (stdout) by default. The program may over-ride this default by use of the AWK redirection operator to place the output in a file or on a printer. print "This will be written to file XYZ.DAT" >"XYZ.DAT" outfile = "XYZ.DAT" print "This will be written to file XYZ.DAT" >outfile print "This will go to the printer" >"PRN" Statements - printf The "printf" statement in AWK is very similar to its counterpart in the 'C' language. The first parameter of the printf statement is a string containing "format specifiers" which determine how the remaining parameters are formatted and printed. The format string is always required, additional parameters are required based on the number of specifiers in the format string. A format specifier has the following parts: %[-][0][width][.prec]char ! ! ! ! ! +----> printf format ctrl char ! ! ! ! +---------> max string width or number ! ! ! ! digits to right of decimal ! ! ! +----------------> minimum width for field ! ! +---------------------> pad with leading zeros ! +------------------------> left justify result +--------------------------> format string specifier Items within square brackets ([ ]) are optional. The following table lists the valid printf format control characters: Character PRINTF Expression c ASCII character d decimal integer e [-]d.ddddddE[+-]dd f [-]ddd.dddddd g e or f format whichever is shorter o unsigned octal number s string x unsigned hexidecimal number % literal % character As is the case with the "print" statement the output of the gAWK Documentation - Page 18 "printf" statement may be redirected via the AWK redirection operator (>). One difference from the "print" statement is that the "printf" statement requires the programmer to fully specify all field and record delimiters. The OFS and ORS builtin variables are not used with "printf" and must be supplied in the format string if so desired. Statements - if The AWK "if" statement is implemented in the same manner as is found in the 'C' language. The basic format is as follows: if (expression) statement1 else statement2 If the expression in TRUE statement1 is executed otherwise statement2 is executed. The "else" portion is optional and need not be coded if there is not alternative action to take when "expression" is FALSE. Both statement1 and statement2 may be replaced by several statements if the statements are enclosed within curly braces: if ($1 == "Jones") { $2 = "Common Name" jones_cnt++ } else $2 = "Uncommon Name" Statements - while The AWK "while" statement executes a statement or block of statements enclosed within curly braces as long as the supplied expression is TRUE. If the expression starts off being FALSE the statements are never executed. Following is the format of the "while" statement: while (expression) statement Following is an example: i = NF while (i > 0) { print $i --i } Statements - do gAWK Documentation - Page 19 The "do" statement is similar to the "while" statement with the exception that the test of the expression is made after the statement has been executed. For this reason the statement(s) within a "do" loop will always be executed at least one time even if the expression starts off being FALSE. The format of the "do" statement is: do statement while (expression) Following is an example: i = NF do { print $i --i } while (i > 0) In this example, what will happen if NF == 0? Statements - for The AWK "for" statement has two forms, one which should be familiar to 'C' programmers and one which should be familiar to SNOBOL programmers. The SNOBOL version allows looping through all the elements of an AWK array and we will defer discussion of this variant until we talk about associative arrays in AWK. The 'C' version of "for" has the following format: for (exp1; exp2; exp3) statement This version of the "for" statement can best be described via the programming constructs from which it is comprised. Following is AWK language code which implements a "for" statement using constructs we have already covered: exp1 while (exp2) { statement exp3 } In verbiage this means that exp1 is executed at the start of the loop one time. Then while exp2 is TRUE the statement associated with the "for" is executed followed by exp3. This loop continues until exp2 is FALSE. Note that if exp2 is FALSE at the beginning of the loop it is never executed. Following is an example of this type of "for" statement: for (i = NF; i > 0; --i) print $i gAWK Documentation - Page 20 Looking back at our example in the discussion of the "while" statement you will note that this example performs the identical function. Statements - delete The "delete" statement removes an element of an associative array from memory. Again, we will defer discussion of this statement to the section on AWK arrays. Statements - break The AWK break statement is used to terminate one of the looping constructs prior to its normal termination. Use of the "break" statement outside of a loop is invalid. The following examples demonstrate the use of "break": i = NF while (1) { if (i > 0) print $i else break --i } for (i = NF; 1; --i) if (i > 0) print $i else break Statements - continue The "continue" statement in AWK, as in 'C', is used within a loop to immediately return to the expression evaluation portion of the looping statement. In the case of a "while" or a "do" loop the loop expression is evaluated and the loop is continued or terminated based on its value. In the case of a for loop, exp3 is executed and then exp2 is evaluated to determine if the loop should terminate. In either case the remaining code in the loop is not executed during the current iteration. The following example prints out all fields of a record which contain valid integer numbers. The "continue" statement is used to skip the printing if the match for numeric value fails: for (i = 1; i <= NF; ++i) { if ($i !~ /^[0-9]+$/) continue printf("%d ", $i) } gAWK Documentation - Page 21 Statements - next The AWK "next" statement is used to terminate the processing of the current input record and continue the implied input loop with the next record to be processed. Recall that each input record is matched against every pattern in the program and, if TRUE, executes the corresponding action. If a particular pattern decides that the program should not continue processing a particular record the "next" statement can be used to discard the current record and proceed with the next one. The following example uses "next" to discard records that have less than five fields: NF < 5 { next } $6 == "Jones" { print "Record", NR, "is a Jones" } Statements - exit The "exit" statement can be used within an AWK action to terminate processing of the program before the end of input. The "exit" statement will terminate the implied input loop and execute the END action if the program has one. If the "exit" statement appears within the action associated with the END pattern it simply terminates the program. The following program terminates processing after reading 20 input records: NR > 20 { print "Terminating execution" exit } { print "Processing record", NR } END { print "Done processing" } Statements - assignment The AWK assignment statement is similar to its 'C' counterpart. It is used to assign a new value to a variable. The AWK assignment statement supports all the 'C' variations such as: Operator Format Meaning = x = y x = y += x += y x = x + y -= x -= y x = x - y *= x *= y x = x * y /= x /= y x = x / y %= x %= y x = x % y ^= x ^= y x = x ^ y gAWK Documentation - Page 22 Builtin Functions The Expressions section above presented a table of the functions built into AWK. We will now examine each of these functions in closer detail. Builtin Functions - atan2(x, y) This function calculates the arctangent of x / y. The return value is in the range -PI to PI. The signs of both arguments are used to determine the quadrant of the return value. The following example prints the arctangent of 1.0 and -1.0: print "Arctangent of 1 and -1 is:", atan2(-1, 1) Builtin Functions - cos(x) This function returns the cosine of its parameter x. The following example displays the cosine of PI: PI = 3.14159265359 print "Cosine of PI is:", cos(PI) Builtin Functions - exp(x) This function returns the value of e raised to the x power. The following prints the value of e ^ 2. print exp(2) Builtin Functions - gsub(r, s, t) The gsub() function performs a global substitution of string s for each match of regular expression r in string t. If string t is omitted from the call $0 is used in its place. The regular expression supplied as r may be a literal regular expression or a string which is to be treated as a dynamic regular expression. The function returns the number of substitutions made. Following is an example: t = "It is the best time, isn't it?" cnt = gsub(/is/, "was", t) printf "Count(%d), Result(%s)\n", cnt, t This code will print the following: Count(2), Result(It was the best time, wasn't it?) Builtin Functions - index(s, t) The index() function searches the string s for the substring t and returns the position of the first match or zero if t is not a substring of s. Following is an example: gAWK Documentation - Page 23 s = "It was the best of times" print index(s, "best"), index(s, "It"), index(s, "xyz") This code will produce the following output: 12 1 0 Builtin Functions - length(s) This function will return the length of the string s in characters. Builtin Functions - lower(s) The lower() function converts all upper case letters in string s to lower case. It returns the converted string. This function is not included in Unix versions of AWK and is a gAWK extension. s = lower("NOW is The timE 1234") print s This code will produce the following output: now is the time 1234 Builtin Functions - int(x) The int() function returns a numeric value which is the largest integer less than x. The following examples demonstrate this function: print "This should print 2:", int(2.12345) print "This should print -5:", int(-4.5) Builtin Functions - log(x) This function returns the natural logarithm of x. This function is undefined for negative values and will produce a run time error. Builtin Functions - match(s, r) The match() function searches string s for a match with regular expression r. It returns the position of the beginning of the match or zero if no match occurred. As a side effect it sets builtin variables RSTART and RLENGTH. RSTART is set to the beginning position of the match and RLENGTH is set to the length of the matched string. Following are several examples: gAWK Documentation - Page 24 s = "I must be kind, only to be cruel" t = ".*" print match(s, /(kind)|(be)/), RSTART, RLENGTH print match(s, t), RSTART, RLENGTH print match(s, "none"), RSTART, RLENGTH The following output is produced by this code: 7 7 2 1 1 32 0 0 0 Builtin Functions - rand() This function returns a pseudorandom number which is greater than or equal to zero but less than one. Refer to the srand() function for information on seeding the random number generator. Builtin Functions - reverse(s) This function returns its argument as a string in which all the characters are reversed. For example: print reverse("ABCDEF") The above statement will produce the output FEDCBA. The reverse() function is a gAWK extension and is not available in Unix AWK. Builtin Functions - sin(x) This function returns the sine of its argument x. The following example prints the sine of PI / 2 which should be 1.0. PI = 3.1415926535 print "Sine of PI / 2:", sin(PI / 2) Builtin Functions - split(s, a, r) The split() function is used to split a string "s" into fields in array "a" based upon a regular expression "r". The regular expression passed may be either a literal expression (/regexp/) or a dynamic expression ("regexp"). If "r" is omitted then the current value of the FS builtin variable is used. The split() functions uses the regular expression to find field delimiters within the string. It then creates an associative array of fields and returns the number of fields (or array elements) created. For example, the following code will split a string delimited by commas and then print out each individual field in the string. gAWK Documentation - Page 25 str = "Now,is the,time,for all,good,men and women" flds = split(str, arr, /,/) print "The string contains", flds, "fields" for (i = 1; i <= flds; ++i) print "Field", i, "(" arr[i] ")" The above code should produce the following output: The string contains 6 fields Field 1 (Now) Field 2 (is the) Field 3 (time) Field 4 (for all) Field 5 (good) Field 6 (men and women) Builtin Functions - sprintf(fmt [,exp] ...) The sprintf() function is very similar to its C language counterpart with the exception that the AWK sprintf() returns its resultant string rather than being passed a pointer of a buffer to place it in. The format string "fmt" is the only required argument and it may contain format specifiers as documented under the "printf" statement. The variable number of "exp" arguments passed should equal the number of print specifiers in the format string. The return value is the resultant string after applying the expression list to the format string as defined by the format specifiers. Following is an example: x = sprintf("Current filename is %s", FILENAME) print "(" x ")" Builtin Functions - sqrt(x) This function returns the square root of x. It is undefined for negative numbers and will produce a run time error. Builtin Functions - srand(x) The srand() function may be used to set a starting point for generating a series of pseudorandom numbers. It may be called with or without an argument. If an argument is passed that value is used to seed the random number generator. If no argument is passed the random number generator is seeded from the current time of day. Builtin Functions - sub(r, s, t) The sub() function is similar to the gsub() function but makes at most one substitution. sub() will substitute "s" for the leftmost substring of "t" which is matched by the regular expression "r". If "t" is omitted it is assumed to be $0. The sub() function returns the number of gAWK Documentation - Page 26 substitutions made which will be either zero or one. The argument "r" may be either a literal or dynamic regular expression. Builtin Functions - substr(s, x, y) The substr() function returns the substring of "s" which begins at position "x" for a length of "y". The length argument "y" may be omitted in which case substr() returns the substring beginning at position "x" for the remainder of the string. If "x" is greater then the number of characters in string "s" a NUL string is returned. Following are some examples and the output they produce: STATEMENT OUTPUT print substr("ABCDEFGHIJK", 5) EFGHIJK print substr("ABCDEFGHIJK", 5, 2) EF print substr("ABCDEFGHIJK", 11, 1) K print substr("ABCDEFGHIJK", 12, 1) Builtin Functions - system(s) The system() function will invoke a new command shell and execute the string "s" as a command under this child shell. The string passed may be a builtin MSDOS or OS/2 command such as DIR, or an external program file. The return value of the function is the return code of the command executed. The following example displays a sorted directory list using the SORT.EXE filter: BEGIN { fil = "$$$.tmp" system(sprintf("dir | sort >%s", fil)) ARGV[1] = fil ARGC = 2 } { if (" " == substr($0, 1, 1)) next printf("%-16s %6d\n", $1 "." $2, $3 } END { system(sprintf("del %s", fil)) } Builtin Functions - upper(s) The upper() function returns its argument string with all lower case letters converted to upper case. This function is a gAWK extension and is not available under Unix AWK. gAWK Documentation - Page 27 SPECIAL AWK FEATURES Associative Arrays As we have hinted at during discussion of various other features, AWK supports arrays similar to the manner in which SNOBOL implements them. In AWK an array subscript is a string rather than a number as in most languages. It is, therefore, perfectly legal in AWK to reference arr["HI"] as an array element. You should also note that this is not the same array element as defined by arr["hi"]. Array subscripts which are specified as numbers are converted to strings so arr["22"] and arr[22] refer to the same array element. In converting numbers to strings no leading zeros are added and since all subscript characters are significant arr["01"] and arr[1] do NOT refer to the same element. Multidimensional arrays in AWK are created with the same notation as used in most languages, i.e. arr[i, j, k], however, in AWK the multiple subscripts are concatenated together to form a single subscript. The value of the builtin variable SUBSEP is placed between each subscript value. If an array element is assigned a value with the statement arr["SUB1", "SUB2"] = "hi" it can also be referenced as arr["SUB1" SUBSEP "SUB2"]. The SUBSEP builtin variable is initialized to the octal number /034 (Ctrl-\) however it can be changed by the programmer to any character or string which will allow multidimensional array elements to be unique. AWK arrays are dynamically created and can be expanded or contracted at will. There is no need to declare a variable as an array, simply assigning it values as a subscripted variable is sufficient. The AWK "delete" statement may be used to remove elements from an array. The format of the delete statement is "delete arr-element" and it is written as "delete arr[1]" in AWK code. The delete statement removes the specified element from the array and frees all storage it occupied. Associative Arrays - Membership Test Since an array element can be created simply by referring to it by name it is not possible to test for the existence of a particular element via a statement of the form: if (arr[1] == "") .... Since the reference to arr[1] will create it if it doesn't already exist and assign it the default variable value of a NUL string the above statement is unconditionally true. A special format of the if statement exists within AWK for the purpose of testing an array element for existence: gAWK Documentation - Page 28 if ("1" in arr) .... In the above example if the array element arr["1"] exists the statement will be TRUE otherwise it will be FALSE. If the element doesn't exist it will not be created by this statement. The membership test can be used to test for members of multidimensional arrays by using the following format: if ((i, j) in arr) .... Associative Arrays - Element Enumeration An array in most conventional languages is pre-defined to the compiler or interpreter and restricted to certain bounds. In general, either 0 or 1 is implicitly defined as the lower bound and the upper bound is programmer defined. In either case the subscript value for all elements is known as the range of numbers from the lower to the upper bound. In AWK this is not the case as the set of array subscripts in use is disjoint. AWK provides a variation of the "for" statement which allows all active subscripts within an array to be enumerated. The format of this statement is: for (sub in arr) .... This loop will be executed once for each element of the array "arr". On each iteration of the loop the scalar variable "sub" will be assigned the value of the current array subscript. Therefore, the code: for (sub in arr) print "arr[" sub "]=", arr[sub] will print out all the elements of array "arr". This version of the "for" statement does not support multidimensional array notation for subscripts, however, it can be used on multidimensional arrays since, as previously mentioned, they are really stored as single dimension arrays with concatenated subscript values. If the individual subscript elements need to be accessed that can be obtained via the split() builtin function. For example: arr[1, 1] = 1; arr[1, 2] = 2; arr[1, 3] = 3 for (i in arr) { split(i, x, SUBSEP) print "arr[" x[1] "," x[2] "," x[3] "]=", arr[x[1], x[2], x[3]] } gAWK Documentation - Page 29 Associative Arrays - Example We will leave this discussion of AWK arrays by presenting an example of there use which, I believe, will demonstrate how powerful they can be. The following short AWK program will read any number of text files specified on the command line and produce a report of the number of lines in each file: { cnt[FILENAME]++ } END { for (i in cnt) { printf("File %-16s %5d line%s\n", i, cnt[i], cnt[i] == 1 ? "" : "s") } } Please note that the majority of the code in this example is concerned with displaying the output of the program. The actual work of counting the lines within each file is performed with a single AWK statement. REFERENCES Aho, Alfred V., Brian W. Kernighan, and Peter J. Weinberger [1988] "The AWK Programming Language", Addision-Wesley Publishing Company, 1988. Downs, Brian W. [1989], "AWK Comes of Age, Part 1", Unix World, January 1989, pp 103-109. Downs, Brian W. [1989], "AWK Comes of Age, Part 2", Unix World, February 1989, pp 115-122. Kernighan, Brian W., and Rob Pike [1984], "The UNIX Programming Environment", Prentice-Hall, 1984. Tare, R. S. [1987], "UNIX Utilities", McGraw-Hill, 1987. CREDITS This package was originally developed in cooperation with the GNU Project headed by Dr. Richard Stallman. It has been enhanced and modified by numerous authors and is distributed under the guidelines of the Free Software Foundation. These guidelines may be found in a separate file named "COPYING". To the best of my knowledge all of the authors of this package agree with this distribution policy and fully support the free distribution of software in source code form. gAWK Documentation - Page 30 The original version of gAWK was developed by Paul Rubin in 1986 and released to the GNU Project. The original version of the gAWK builtin functions was written by Jay Fenlason in 1986. The enhancements for range patterns and various other fixes were made by a programmer identified only as "jfw". Numerous fixes were applied by a programmer identified only as "JF". All of the newer features of AWK were implemented by Bob Withers. The code was also ported to both MSDOS and OS/2 systems under Microsoft C V5.10. The AWK grammer for this release was processed by the PD version of YACC which was originally developed by J van Katwijk of The Delft University of Technology, Delft, The Netherlands. This code has been extensively modified and ported by Bob Denny, Scott Guthery, and Bob Withers among others. There are, I'm sure, other hands through which this code has passed on its way to me but I have not been able to identify them. To those programmers I apologize for the omission and express thanks for their efforts. gAWK Documentation - Page 31